Introduction

Description of data

Our data is from Motivate.
Motivate is the most experienced bike share team.They launched CitiBike in 2013 in NYC.
Also, they collect System data as CSV files, which can be downloaded from the link above.
In this final project, we analyze the data from June, 2018 because we think it is a common time period when people prefer to ride bikes under relatively mild weather, which can be very representative.
We do realize that season and time will have an impact on people’s choice on riding bikes, so in our analysis, we will include other data file such as weather information from other sources in order to analyze the time and seasons. Besides, the website provides data from June 2013 so we can analyze the time series pattern as well. When it comes to time series, we downloaded data from the source from 2017 June to 2018 June, and sampled each dataset with randomly choosing 5% of the data. Then we concat them together. The code combine the dataset is written in python and is also included in the github repo named sampledata.py.
The feature includes:

Analysis of data quality

Load Data

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggmap)
library(leaflet)
library(leaflet.extras)
df <- read.csv('./201806-citibike-tripdata.csv')

Definition on outliers

We choose dataset of 201806 as mentioned in Part.2. The unit of trip duration in the dataframe is second which is not suitable for analyzing. Thus, we transform the unit into minutes by dividing 60. Since the data is bike.

df <- df %>% mutate(minute=tripduration/60)
overalltripduration <-  ggplot(data=df,mapping=aes(x=1,y=minute)) + geom_boxplot(fill='Orange')+ggtitle('Distribution of time duration')
overalltripduration + theme(plot.title = element_text(lineheight=.8, face="bold"))

From the above graph boxplot, we see that the outlier such as time below 0 and the outliers may cause trouble in our analysis. Here, we define that timeduration < 0 and time duration over 120 as outliers. In the above mentioned code, since the ratio of time duration over 120 is about 0.3%. We can confidently remove those data. Because normally one do not use a bike for that long.

df <- subset(df,minute<120 & minute>0)
overalltripduration2 <-  ggplot(data=df,mapping=aes(x=1,y=minute)) + geom_boxplot(fill='Orange')+ggtitle('distribution of time duration')
overalltripduration2 + theme(plot.title = element_text(lineheight=.8, face="bold"))

overalltripduration3 <- ggplot(data=df,mapping=aes(minute)) + geom_histogram(fill='Orange',color='black')+ggtitle('distribution of time duration')
overalltripduration3 + theme(plot.title = element_text(lineheight=.8, face="bold"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

data quality

summary = df %>%
  group_by(start.station.id) %>%
  summarise(lat=as.numeric(start.station.latitude[1]),
            long=as.numeric(start.station.longitude[1]),
            name=start.station.name[1],
            n.trips=n())
leaflet(summary) %>%setView(-74.00, 40.71, zoom = 12) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircles(lng = ~long, lat = ~lat)
# There are two points outside New York state. Apprently those are outliers, so we remove them. 
id = which(df$start.station.latitude > 45)
df = df[-id,]

Analysis of high freq of birth year 1969

df$factor_year = as.factor(df$birth.year)
summary_18_6 = df %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #1969
summary_18_6 = mutate(summary_18_6, time = rep("201806", nrow(summary_18_6)))


ggplot(summary_18_6, aes(x = factor_year, y = Freq)) + geom_point() +
  coord_flip() + xlab("user's birth year") + 
  ggtitle("Frequency of Citi Bike users' birth year in 2018/6")

One interesting observation is that the mode of user’s birth year is 1969 (49 years old now) in 2018/6. So it is worth to discuss what is going on in that group of users because the frequency is way more higher than any other birth year users. Thus, we looked at dataset from 2016/6, 2017/6, 2017/12, 2018/1 and 2018/5.

data_16_6 = read.csv("./data/201606-citibike-tripdata.csv")
data_17_6 = read.csv("./data/201706-citibike-tripdata.csv")
data_17_12 = read.csv("./data/201712-citibike-tripdata.csv")
data_18_1 = read.csv("./data/201801-citibike-tripdata.csv")
data_18_5 = read.csv("./data/201805-citibike-tripdata.csv")
data_18_6 = df
data_18_5 = data_18_5 %>% mutate(minute=tripduration/60)%>%
   filter(minute<120 & minute>0)
data_17_6 = data_17_6 %>% mutate(minute=tripduration/60)%>%
   filter(minute<120 & minute>0)
data_16_6 = data_16_6 %>% mutate(minute=tripduration/60)%>%
   filter(minute<120 & minute>0)
data_17_12 = data_17_12 %>% mutate(minute=tripduration/60)%>%
   filter(minute<120 & minute>0)
data_18_1 = data_18_1 %>% mutate(minute=tripduration/60)%>%
   filter(minute<120 & minute>0)
data_18_6$factor_year = as.factor(data_18_6$birth.year)
summary_18_6 = data_18_6 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #1969
summary_18_6 = mutate(summary_18_6, time = rep("201806", nrow(summary_18_6)))


data_17_6$factor_year = as.factor(data_17_6$birth.year)
summary_17_6 = data_17_6 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #NULL
summary_17_6 = mutate(summary_17_6, time = rep("201706", nrow(summary_17_6)))

data_16_6$factor_year = as.factor(data_16_6$birth.year)
summary_16_6 = data_16_6 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #NULL
summary_16_6 = mutate(summary_16_6, time = rep("201606", nrow(summary_16_6)))
df_2month = full_join(summary_18_6, summary_17_6, by = "factor_year")
## Warning: Column `factor_year` joining factors with different levels,
## coercing to character vector
df_3month = full_join(df_2month, summary_16_6, by = "factor_year")
## Warning: Column `factor_year` joining character vector and factor, coercing
## into character vector
t1 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq.x, time = df_3month$time.x)
t2 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq.y, time = df_3month$time.y)
t3 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq, time = df_3month$time)

tf = rbind(t1,t2,t3)
ggplot(tf, aes(x = factor_year, y = Freq, color = as.factor(time))) + geom_point() +
  coord_flip() + xlab("user's birth year") + 
  ggtitle("Frequency of Citi Bike users' birth year in different months")+
  labs(fill="Time")
## Warning: Removed 39 rows containing missing values (geom_point).

This Cleveland Dot Plot shows the distribution of user’s birth year in 2016/6, 2017/6 and 2018/6. They are all left-skewed, which is what we expect.

data_17_12$factor_year = as.factor(data_17_12$birth.year)
summary_17_12 = data_17_12 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #NULL
summary_17_12 = mutate(summary_17_12, time = rep("201606", nrow(summary_17_12)))

data_18_1$factor_year = as.factor(data_18_1$birth.year)
summary_18_1 = data_18_1 %>% group_by(factor_year) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) #1969
summary_18_1 = mutate(summary_18_1, time = rep("201606", nrow(summary_18_1)))
Time Top Three Birth Year
2016/6 NA(191,500), 1985(54,230), 1988(52,674)
2017/6 NA(188,803), 1989(65,898), 1988(65,124)
2017/12 NA(38,928), 1985(32,877), 1989(32,818)
2018/1 1969(39,540), 1989(26,741), 1990(26,724)
2018/5 1969(215,754), 1988(70,506), 1989(70,506)
2018/6 1969(218,208), 1988(79,053), 1989(76,515)

So it is resaonable to assume that NA birth year of data have been changed to 1969 since Jan 2018. So that is the reason that frequency of 1969 in June 2018 is extremely high.

Overall, data quality of our dataset is pretty good. The outlier we defined, which is time duration over 120 minutes, only accounts for 0.3% data in our dataset. From this point of view, we can drop out those rows.

Main analysis (Exploratory Data Analysis)

Analysis on Age distribution of our customer

set.seed(321)
index = sample(nrow(data_18_6), round(nrow(data_18_6)*0.001))
sample = data_18_6[index,]
g = ggplot(sample, aes(birth.year,minute))
g + geom_point(alpha = 0.5) + ylab("trip duration (in min)") +
  xlab("users' birth year") + ggtitle("Scatterplot of uses' birth year and trip duration (in sec) in Jun 2018")

There are 1.945,611 observations in my data set, so we sampled 1% of them to see the relationship between user’s birth of year and trip durations.
We were expect to see an increasing and then decreasing trend as in the distribution of users’ birth year (right-skewed). However, there is no obvious trend in the scatterplot. ### Heatmap

## get station info
station.info <- data_18_6 %>%
  group_by(start.station.id) %>%
  summarise(lat=as.numeric(start.station.latitude[1]),
            long=as.numeric(start.station.longitude[1]),
            name=start.station.name[1],
            n.trips=n())
leaflet(station.info) %>%setView(-74.00, 40.71, zoom = 12) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addHeatmap(lng=~long, lat=~lat, intensity = ~n.trips, radius = 7, gradient = "YlGnBu")
pal = colorNumeric(
  palette="YlOrRd",
  domain = station.info$n.trips
)

leaflet(station.info) %>%setView(-74.00, 40.71, zoom = 12) %>%
  leaflet::addProviderTiles("CartoDB.Positron") %>%
  leaflet::addCircles(lng = ~long, lat = ~lat, color =~pal(n.trips)) %>%
  leaflet::addLegend("bottomright", pal = pal, value = ~n.trips, title = "Num of Trips (2018.6)")

### Analysis on weather and season #### Join weather data to see if weather has an impact Is it reasonable to assume that weather and season has a huge impact on CitiBike frequency.

NYCweather<-read.csv("./data/NYCweather.csv")
df$Date <- as.Date(df$starttime)
NYCweather$Date<-as.Date(NYCweather$Date)
df_weather <- inner_join(df,NYCweather,by="Date")
df_weather$Severe <- as.factor(df_weather$Severe)
hist(df_weather$Date, breaks="days", freq=TRUE)

ggplot(df_weather, aes(x=Date))+
  geom_histogram(colour="white", fill="steelblue")+
  ggtitle("Number of trips by day")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(df_weather, aes(x=Severe, y=))+
  geom_bar(colour="white", fill="steelblue")+
  ggtitle("Number of trips per day for severe and unsevere weather condition")

ggplot(data = df_weather,
       aes(x = Windspeed, y = minute)) +
  scale_x_continuous("Windspeed") +
  scale_y_continuous("Average trip duration in minutes") +
  ggtitle("Trip duration vs. Windspeed") + 
  stat_summary(fun.y="mean", geom = "line", colour="steelblue", size=1) 

ggplot(data = df_weather,
       aes(x = Temperature, y = minute)) +
  scale_x_continuous("Temperature") +
  scale_y_continuous("Average trip duration in minutes") +
  ggtitle("Trip duration vs. Temperature") + 
  stat_summary(fun.y="mean", geom="line", colour="steelblue", size=1) 

ggplot(data = df_weather,
       aes(x = Precipitation, y = minute)) +
  scale_x_continuous("Precipitation") +
  scale_y_continuous("Average trip duration in minutes") +
  ggtitle("Trip duration vs. Precipitation") +
  stat_summary(fun.y="mean", geom = "line", colour="steelblue", size=1)

ggplot(data = df_weather,
       aes(x=factor(Severe), y=minute))+
  xlab("Severity of weather")+ylab("Average trip length")+
  stat_summary(fun.y="mean", geom="bar")

df_by_temp <- df_weather %>% group_by(Temperature) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_temp, aes(x=as.numeric(Temperature), y=`n_distinct(starttime)`))+
  stat_summary(geom="bar", fill="steelblue")+
  xlab("Temperature")+
  ylab("Average trips per day")+
  ggtitle("Average daily trips in different temperatures")
## No summary function supplied, defaulting to `mean_se()

df_by_windspeed <- df_weather %>% group_by(Windspeed) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_windspeed, aes(x=as.numeric(Windspeed), y=`n_distinct(starttime)`))+
  stat_summary(geom="bar", fill="steelblue")+
  xlab("Windspeed")+
  ylab("Average trips per day")+
  ggtitle("Average daily trips in different windspeeds")
## No summary function supplied, defaulting to `mean_se()

df_by_precip <- df_weather %>% group_by(Precipitation) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_precip, aes(x=as.numeric(Precipitation), y=`n_distinct(starttime)`))+
  stat_summary(geom="bar", fill="steelblue")+
  xlab("Precipitation")+
  ylab("Average trips per day")+
  ggtitle("Average daily trips in different precipitations")
## No summary function supplied, defaulting to `mean_se()

#### Interactive time series to see if season has an impact on customer flow with Dygraph We combine over a year time series to see if season has an impact on customer flow with Dygraph. The data is from 201706 to 201806.

library(dygraphs)
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following object is masked from 'package:leaflet':
## 
##     addLegend
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(zoo)
df_timeseries <- read.csv('./data/data_2017_2018.csv')
a <- lubridate::mdy_hms(df_timeseries$starttime)
## Warning: All formats failed to parse. No formats found.
b <- as.Date(df_timeseries$starttime)
b[is.na(b)] <- a[!is.na(a)]
df_timeseries$starttime <- b
df_timeseries <- df_timeseries %>% mutate(time = format(as.Date(starttime), "%Y-%m"))
df_timeseries <- df_timeseries[,c("time","usertype")]

df_subscriber <- df_timeseries %>% subset(usertype == 'Subscriber')
df_customer <- df_timeseries %>% subset(usertype == 'Customer')
df_subscriber <- df_subscriber %>% group_by(time) %>% summarise(Number = n())
df_customer <- df_customer %>% group_by(time) %>% summarise(Number = n())

ts_subscriber <- xts::xts(df_subscriber$Number,as.Date(as.yearmon(df_subscriber$time)))
ts_customer <- xts::xts(df_customer$Number,as.Date(as.yearmon(df_customer$time)))
ts <- cbind(ts_subscriber,ts_customer)
dygraph(ts,main = 'Impact of season',ylab = "Frequency") %>% dySeries('..1',label = 'subscriber ') %>% dySeries('..2',label = 'customer') %>% dyLegend(show = "always", hideOnMouseOut = FALSE,width=400) %>%
  dyOptions(colors = RColorBrewer::brewer.pal(3, "Set2"))

### Analysis about customer flow

library(lubridate)
df_timeslot <- mutate(df,hour = hour(as.POSIXct(starttime)))
df_timeslot <- mutate(df_timeslot,wday = wday(as.POSIXct(starttime)))
df_timeslot <- df_timeslot[,c("hour","wday","minute","start.station.name","usertype")]
timeslot <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour))) +geom_bar(color='Orange')+ggtitle('Time slot in trip data group by usertype')
timeslot + facet_wrap(~usertype)

From the above analysis, the overall time slot has two peaks: in 8-9 am and 5-6 pm, which is reasonable since they are the rush hour, which has a large number of passenger flow.
However, the time slot distribution differs a lot with aspect to different usertype and different weekdays.
Take the above images as examples, the customer has a higher peak in the time slot which is around 4 pm while the subscribers has two peaks.

timeslot <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour))) +geom_bar(color='Orange')+ggtitle('Time slot in trip data group by weekdays')
timeslot + facet_wrap(~as.factor(wday))

Also, the weekdays play a big role in the time duration distribution. On weekdays, we still sees similar two peaks. However, on weekends, we can only observe one peak. Thus, we can think some specific pricing strategy for a specific time slot in the weekdays.

Executive summary (Presentation-style)

Provide a short nontechnical summary of the most revealing findings of your analysis written for a nontechnical audience. The length should be approximately two pages (if we were using pages…) Take extra care to clean up your graphs, ensuring that best practices for presentation are followed. * time slot * weather & season

Interactive component

Our data interactive component is written with Shiny and published out here

Conclusion

Discuss limitations and future directions, lessons learned.